Open data has been defined as “data and content [that] can be freely used, modified, and shared by anyone for any purpose” and the number of open data initiatives undertaken by local governments in the United States has increased dramatically over the past decade.
In addition to serving as a key source of data for staff and those internal to local government, these public-facing repositories are also being leveraged by a variety of users from civic hackers, to non-profit organizations, businesses, and individual residents.
Open data portals typically use one of three dominant platforms—ArcGIS Open Data, CKAN, and Socrata. Of these, CKAN is open source and free, while ArcGIS Open Data is part of ESRI’s suite of programs that many local governments already use for mapping and Socrata is billed as a “data as a service” platform with a tiered pricing system.
The majority of city open data portals in the U.S. use Socrata in part because it is “off the shelf” software. As a result, we can access data from a variety of these open data portals through the Socrata platform’s application programming interface of API. An API is essentially a program running on a remote server that standardizes information and requests in a way that servers can understand and manage efficiently. A standard definition of an API is a set of protocols and tools that allow interaction between two different applications.
Please watch the short video below, then proceed with the exercise.
The working directory
is the default location where R stores files (output) and looks for
files (input) and is typically only relevant when one is not operating
within an R
Project environment. This is where your datasets, scripts, etc. are
found and the working directory can be any folder. In order to check
which folder has been designated as the working directory, enter
getwd() in the Console or click the Files tab in the lower
right RStudio panel followed by the More button. If we are working in an
R Script, we can set the working directory programmatically use the
setwd function, or we can use the RStudio Files tab as
described in the preceding sentence.
Note that for pathnames, you may use a forward slash /
or you may use double back slashes \\
otherwise, you will get an error message. This is because the the back
slash is special character that is used in string matching and so in
order to distinguish is use as a special character versus its use as a
regular old \ we use two in order to “escape” its
interpretation as a special character. More on regular expressions in a
few weeks though…
On December 17, 2021 the mayor of San Francisco declared a State of Emergency in the city’s Tenderloin district. This step was taken in response to longstanding issues with public drug use, overdose deaths, and crime in the neighborhood that had recently intensified. From a legal perspective, the declaration reduces bureaucratic hurdles that might slow the implementation of the mayor’s Tenderloin Emergency Intervention Plan.
San Francisco is widely viewed as a tech hub and the city has been among the leaders in the U.S. in opening up its data resources. This lab exercise asks you to apply what we have learned to date to explore the underlying issues that led to the State of Emergency declaration.
In the code chunk below, we will install and load a few packages to help us access data from the City of San Francisco’s open data portal. We will also specify which of the CRAN Mirror sites to download packages programmatically. In this example, I am pointing to the site hosted at Oak Ridge National Laboratory in Tennessee, but any of these sites will work (i.e., they are mirrors with the same content).
Note that inside the curly brackets of the code chunk I have set the
message and warning options to
FALSE which means that this kind of output generated by
executing this code chunk will be displayed in the
Console rather than appearing in the output document
(in our case, an HTML file). You can also set the code chunk options by
clicking the gear icon in the upper right corner of the code chunk.
options(repos=c(CRAN="https://mirrors.nics.utk.edu/cran/"))
install.packages("RSocrata")
## package 'RSocrata' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\bw6xs\AppData\Local\Temp\RtmpuKFuo1\downloaded_packages
install.packages("janitor")
## package 'janitor' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\bw6xs\AppData\Local\Temp\RtmpuKFuo1\downloaded_packages
library(RSocrata)
library(jsonlite)
library(tidyverse)
library(janitor)
As was mentioned earlier, Socrata is a platform that is used by many local governments to publish open data. The Socrata Open Data API (SODA API) allows us to access these data resources from an R script or notebook in a standardized way. In the video you watched, the API was likened to a waiter or liaison who carries requests and responses back and forth between a client computer (i.e., you) and server computer (i.e., the open data portal). The documentation for the SODA API can be found here and begins with API endpoints, which are essentially Uniform Resource Locators (URLs) that provide access to data. Without getting into too much detail at the moment, we use the Hypertext Transfer Protocol (HTTP) to send requests and receive responses. Most often, we are using the GET method to send queries and retrieve data. For a quick refresher, take another look at this portion of the assigned reading for this week.
In some cases, an R package exists that allows us to interact with an
API without having to construct queries that conform to the requirements
of the API, and then pass them using a package like httr
(this is what the ungraded exercise that we started in class last time
asked you to do from your web browser). The RSocrata package that we
installed and loaded above does this “dirty work” of formatting the
underlying HTTP queries that allow us to interact with open data portals that use the
Socrata platform… and there are a LOT of them out
there.
Let’s take a closer look…
help(package="RSocrata")
The “Inspect RSocrata Documentation” code chunk simply opens a page
with seven functions that are included in this package. Take a look at
this page (i.e., in the Help tab in the lower-right
corner), then proceed to Exercise 1 below.
Review the documentation page that appears after running the code chunk above.
https://data.sfgov.org<-
There are more detailed instructions on what to submit and how at the
end of this notebook, but you essentially you should add code chunks and
text chunks (i.e., Markdown sections) to this R Notebook that you have
saved locally that perform the tasks and that answer the questions posed
above.
Next, return to the documentation page for the RSocrata package. The
read.socrata function can be used to retrieve a dataset
from the San Francisco open data portal. In addition to indexing objects
like lists or data frames by their
positions using bracket notation, we can also refer to variable
names, if they exist using the $ operator. Let’s sort the
inventory and peruse the types of information available. The
table function provides a count of instances within each
category.
sf_contents <- ls.socrata("https://data.sfgov.org/limitTo=datasets")
names(sf_contents)
head(sf_contents)
sort(sf_contents$title)
table(sf_contents$theme)
sf_contents %>% dplyr::select(title) %>% dplyr::arrange(title)
janitor::tabyl(sf_contents$theme)
How many datasets are there on the San Francisco portal right now?
How many are related to Public Safety? Recall that the default behavior
is for the output of a code chunk to appear below it, and the tabular
results of the four functions that are executed after the
sf_contents data frame object is created can be viewed
there too. The names function displays the column or
attribute names while the head function displays the first
six rows of the sf_contents data frame object. The
sort function places the contents of the specified column
or attributes in alphabetical order. The last two lines simply
demonstrate how you might do the same thing while avoiding base R
functions like sort and table.
Tables are nice, but there must be a better way to present this
information. Let’s create create a barchart of the portal’s contents
with the ggplot2 package that we used in a practice
exercise last week (we will begin formally working with
tidyverse packages next week). Take a look at the
arguments for the geom_bar function. The code chunk below
also makes use of the ggsave function to export the
barchart to a .png file.
ggplot(data = sf_contents) +
geom_bar(stat = "count", aes(x = as_factor(theme)), fill = "dodgerblue") +
labs(x="", y = "Number of Datasets") +
theme(axis.text.x = element_text(size = 8, angle = 90))
ggsave("Datasets By Theme.png", units = "in", width = 16, height = 8)
Now that we have a sense of what types of datasets are available, we can use what we have learned about subsetting to extract only those datasets related to Public Safety from the larger sf_contents data frame object that represents the holdings of the entire data portal. The line of code below creates a new data frame object where only the contents of the San Francisco data portals that are tagged as Public Safety are included. Insert a new code chunk or type it directly into the Console to try it yourself.
pubsaf_contents <- sf_contents[sf_contents["theme"] == "Public Safety", ]
sort(pubsaf_contents$title)
Again the `sort`` function just places things in alphabetical order. Do any of these datasets look like they might be relevant to the State of Emergency in the Tenderloin?
The code chunk below demonstrates how you might pull down a specific
dataset of interest included in the inventory using bracket subsetting
and the dplyr package (your choice).
police_calls_brackets <- sf_contents[sf_contents$title == "Police Department Incident Reports: 2018 to Present", ]
str(police_calls_brackets)
## 'data.frame': 1 obs. of 14 variables:
## $ accessLevel : chr "public"
## $ landingPage : chr "https://data.sfgov.org/d/wg3w-h783"
## $ issued : POSIXct, format: "2021-05-15"
## $ @type : chr "dcat:Dataset"
## $ modified : POSIXct, format: "2023-02-02"
## $ keyword :List of 1
## ..$ : chr "crime" "crime reports" "sfpd" "police department"
## $ contactPoint:'data.frame': 1 obs. of 3 variables:
## ..$ @type : chr "vcard:Contact"
## ..$ fn : chr "OpenData"
## ..$ hasEmail: chr "mailto:no-reply@data.sfgov.org"
## $ publisher :'data.frame': 1 obs. of 2 variables:
## ..$ @type: chr "org:Organization"
## ..$ name : chr "data.sfgov.org"
## $ identifier : chr "https://data.sfgov.org/api/views/wg3w-h783"
## $ description : chr "<strong>A. SUMMARY</strong>\nRead the <u><a href=\"https://datasf.gitbook.io/datasf-dataset-explainers/sfpd-inc"| __truncated__
## $ title : chr "Police Department Incident Reports: 2018 to Present"
## $ distribution:List of 1
## ..$ :'data.frame': 4 obs. of 5 variables:
## .. ..$ @type : chr "dcat:Distribution" "dcat:Distribution" "dcat:Distribution" "dcat:Distribution"
## .. ..$ downloadURL : chr "https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.rdf?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.json?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.xml?accessType=DOWNLOAD"
## .. ..$ mediaType : chr "text/csv" "application/rdf+xml" "application/json" "application/xml"
## .. ..$ describedBy : chr NA "https://data.sfgov.org/api/views/wg3w-h783/columns.rdf" "https://data.sfgov.org/api/views/wg3w-h783/columns.json" "https://data.sfgov.org/api/views/wg3w-h783/columns.xml"
## .. ..$ describedByType: chr NA "application/rdf+xml" "application/json" "application/xml"
## $ license : chr "http://opendatacommons.org/licenses/pddl/1.0/"
## $ theme : chr "Public Safety"
## - attr(*, "@context")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
## - attr(*, "@id")= chr "https://data.sfgov.org/data.json"
## - attr(*, "@type")= chr "dcat:Catalog"
## - attr(*, "conformsTo")= chr "https://project-open-data.cio.gov/v1.1/schema"
## - attr(*, "describedBy")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
class(police_calls_brackets)
## [1] "data.frame"
police_calls_tidyverse <- sf_contents %>%
filter(title == "Police Department Incident Reports: 2018 to Present")
str(police_calls_tidyverse)
## 'data.frame': 1 obs. of 14 variables:
## $ accessLevel : chr "public"
## $ landingPage : chr "https://data.sfgov.org/d/wg3w-h783"
## $ issued : POSIXct, format: "2021-05-15"
## $ @type : chr "dcat:Dataset"
## $ modified : POSIXct, format: "2023-02-02"
## $ keyword :List of 1
## ..$ : chr "crime" "crime reports" "sfpd" "police department"
## $ contactPoint:'data.frame': 1 obs. of 3 variables:
## ..$ @type : chr "vcard:Contact"
## ..$ fn : chr "OpenData"
## ..$ hasEmail: chr "mailto:no-reply@data.sfgov.org"
## $ publisher :'data.frame': 1 obs. of 2 variables:
## ..$ @type: chr "org:Organization"
## ..$ name : chr "data.sfgov.org"
## $ identifier : chr "https://data.sfgov.org/api/views/wg3w-h783"
## $ description : chr "<strong>A. SUMMARY</strong>\nRead the <u><a href=\"https://datasf.gitbook.io/datasf-dataset-explainers/sfpd-inc"| __truncated__
## $ title : chr "Police Department Incident Reports: 2018 to Present"
## $ distribution:List of 1
## ..$ :'data.frame': 4 obs. of 5 variables:
## .. ..$ @type : chr "dcat:Distribution" "dcat:Distribution" "dcat:Distribution" "dcat:Distribution"
## .. ..$ downloadURL : chr "https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.rdf?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.json?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.xml?accessType=DOWNLOAD"
## .. ..$ mediaType : chr "text/csv" "application/rdf+xml" "application/json" "application/xml"
## .. ..$ describedBy : chr NA "https://data.sfgov.org/api/views/wg3w-h783/columns.rdf" "https://data.sfgov.org/api/views/wg3w-h783/columns.json" "https://data.sfgov.org/api/views/wg3w-h783/columns.xml"
## .. ..$ describedByType: chr NA "application/rdf+xml" "application/json" "application/xml"
## $ license : chr "http://opendatacommons.org/licenses/pddl/1.0/"
## $ theme : chr "Public Safety"
## - attr(*, "@context")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
## - attr(*, "@id")= chr "https://data.sfgov.org/data.json"
## - attr(*, "@type")= chr "dcat:Catalog"
## - attr(*, "conformsTo")= chr "https://project-open-data.cio.gov/v1.1/schema"
## - attr(*, "describedBy")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
class(police_calls_tidyverse)
## [1] "data.frame"
We can see that the data frame object from above is fairly complex
and consists of lists, vectors, and data frames. Because I spent some
time looking at the API documentation, the
dataset itself, and the results of the str function
above, I know that the JSON endpoint is
https://data.sfgov.org/resource/wg3w-h783.json and I can
use the fromJSON function to pull this data into my R
Notebook directly from the internet.
police_calls_1 <- fromJSON("https://data.sfgov.org/resource/wg3w-h783.json")
police_calls_tb <- as_tibble(police_calls_1)
str(police_calls_tb)
## tibble [1,000 × 33] (S3: tbl_df/tbl/data.frame)
## $ incident_datetime : chr [1:1000] "2021-07-25T00:00:00.000" "2022-06-28T23:58:00.000" "2022-03-11T10:30:00.000" "2021-05-15T17:47:00.000" ...
## $ incident_date : chr [1:1000] "2021-07-25T00:00:00.000" "2022-06-28T00:00:00.000" "2022-03-11T00:00:00.000" "2021-05-15T00:00:00.000" ...
## $ incident_time : chr [1:1000] "00:00" "23:58" "10:30" "17:47" ...
## $ incident_year : chr [1:1000] "2021" "2022" "2022" "2021" ...
## $ incident_day_of_week : chr [1:1000] "Sunday" "Tuesday" "Friday" "Saturday" ...
## $ report_datetime : chr [1:1000] "2021-07-25T13:41:00.000" "2022-06-28T23:58:00.000" "2022-03-11T20:03:00.000" "2021-05-15T17:47:00.000" ...
## $ row_id : chr [1:1000] "105718906372" "116554371012" "113048071000" "103051807043" ...
## $ incident_id : chr [1:1000] "1057189" "1165543" "1130480" "1030518" ...
## $ incident_number : chr [1:1000] "216105573" "220264913" "226040232" "210183345" ...
## $ report_type_code : chr [1:1000] "II" "VS" "II" "VS" ...
## $ report_type_description : chr [1:1000] "Coplogic Initial" "Vehicle Supplement" "Coplogic Initial" "Vehicle Supplement" ...
## $ filed_online : logi [1:1000] TRUE NA TRUE NA NA TRUE ...
## $ incident_code : chr [1:1000] "06372" "71012" "71000" "07043" ...
## $ incident_category : chr [1:1000] "Larceny Theft" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
## $ incident_subcategory : chr [1:1000] "Larceny Theft - Other" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
## $ incident_description : chr [1:1000] "Theft, Other Property, $50-$200" "License Plate, Recovered" "Lost Property" "Vehicle, Recovered, Motorcycle" ...
## $ resolution : chr [1:1000] "Open or Active" "Open or Active" "Open or Active" "Open or Active" ...
## $ police_district : chr [1:1000] "Southern" "Out of SF" "Central" "Out of SF" ...
## $ cad_number : chr [1:1000] NA NA NA NA ...
## $ intersection : chr [1:1000] NA NA NA NA ...
## $ cnn : chr [1:1000] NA NA NA NA ...
## $ analysis_neighborhood : chr [1:1000] NA NA NA NA ...
## $ supervisor_district : chr [1:1000] NA NA NA NA ...
## $ latitude : chr [1:1000] NA NA NA NA ...
## $ longitude : chr [1:1000] NA NA NA NA ...
## $ point :'data.frame': 1000 obs. of 2 variables:
## ..$ type : chr [1:1000] NA NA NA NA ...
## ..$ coordinates:List of 1000
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : NULL
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : NULL
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.5 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.5 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : NULL
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.5 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : NULL
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.7
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.5 37.7
## .. ..$ : num [1:2] -122.5 37.7
## .. ..$ : num [1:2] -122.5 37.8
## .. ..$ : num [1:2] -122.4 37.8
## .. ..$ : num [1:2] -122.4 37.7
## .. .. [list output truncated]
## $ :@computed_region_jwn9_ihcz: chr [1:1000] NA NA NA NA ...
## $ :@computed_region_26cr_cadq: chr [1:1000] NA NA NA NA ...
## $ :@computed_region_qgnn_b9vv: chr [1:1000] NA NA NA NA ...
## $ :@computed_region_nqbw_i6c3: chr [1:1000] NA NA NA NA ...
## $ :@computed_region_h4ep_8xdi: chr [1:1000] NA NA NA NA ...
## $ :@computed_region_n4xg_c4py: chr [1:1000] NA NA NA NA ...
## $ :@computed_region_jg9y_a9du: chr [1:1000] NA NA NA NA ...
We now have 1,000 records (recall that this is the default limit for
some APIs) in a tibble format. If we wanted to write our own API query
we could use the $limit parameter to retrieve more
than the default 1,000 records, but this is okay for our purposes.
Let’s explore the number and timing of police calls to the Tenderloin
neighborhood.
Execute the code chunk above and make sure you understand the
code, then proceed with the tasks below:
tabyl(police_calls_tb$analysis_neighborhood) and compare it
with the Tenderloin in terms of police calls.
Take some time to reflect on this portion of the exercise, then proceed.
Rather than downloading data from the official U.S. Census Bureau website, we can also access data programmatically from R. The first step is to take a quick look at the API documentation, then request an API key by visiting this site. You should receive a response in a few minutes, but be sure to safeguard your API key because it it tied to each individual user.
In order to get a sense for how API queries work, open a web browser and type the following into the search bar, inserting your own personal Census API key where indicated:
https://api.census.gov/data/2021/acs/acs5?get=B01003_001E&for=zip%20code%20tabulation%20area:94114,94110&key=YOUR_KEY_GOES_HERE
The resulting browser window should contain the results of the query above in JavaScript Object Notation (JSON) format. This is a compact way to store and transmit information over the internet that we will return to later in the semester. For now, we just need to know that the response there tells us:
So, we can send queries to the Census API for the 2021 ACS 5-Year Estimates directly from a web browser, but it is more efficient to do this from R. But first of all, what are the different components of the query we just executed?
https://api.census.gov/data/2021/acs/acs5?
component is the base URL. It tells the Census servers that it we are
interested in 5-Year ACS data that ends in 2021 (i.e., ACS
2016-2021)get=B01003_001E is the variable name (Total
population). For a full list, see
here.for=zip%20code%20tabulation%20area:94114,94110 bit
says retrieve values for the census geography ZCTA for the Castro/Noe
Valley (94114) and Inner Mission/Bernal Heights (94110) areas of San
Francisco. Note that these areas are generally southwest of the
Tenderloin. The %20 is a hexademical representation of the
space character. There are quite a few special characters that need to
be properly
encoded, if they are being passed to the server.key=YOUR_KEY_GOES_HERE component is the unique
identifier for the client (i.e., your browser/script) that is making the
request&.This structure is unique to Census
APIs and endpoints. If you want to interact with other APIs, you
will first have to refer to their documentation and understand how to
properly format a query. It is also possible that APIs change over time
causing your code to stop working. Luckily, it is usually not onerous to
make the necessary tweaks, once you understand how the API has changed.
Still using a web browser, practice what we have learned here by
constructing the proper API query in the search bar to:
"DP03_0005PE"
which is located in a different API with a slightly
different base URL here&for=place:67000&in=state:06Data Profile API for Unemployment in #2 and #3 Above
Take some time to experiment with one or more of the other variables,
then proceed with the exercise.
Using the examples provided by the Census, we can write some code that retrieves a variety of variables for the City of San Francisco as opposed to specific zip code tabulation areas.
# install.packages("httr")
library(httr)
library(jsonlite)
baseurl <- "https://api.census.gov/data/2018/acs/acs5?"
param1 <- "get="
param2 <- "&for=place:"
param3 <- "&in=state:06"
key <- "&key=f9f31e09fed0f44d23a8a354469e461df98f34cb"
# These are the place FIPS codes for San Francisco, Oakland, and Berkeley
places_list <- paste("67000", "53000", "06000", sep = ",")
vars_list <- c("B00001_001E", "B19013_001E")
vars_names <- c("Total population", "Median household income")
# Use a simple for loop to download two data elements rather than one
for (i in 1:length(vars_list)) {
req <- httr::GET(paste0(baseurl, param1, vars_list[i],
param2, places_list, param3, key))
req_json <- fromJSON(content(req, "text"), flatten=TRUE)
req_tibble <- as_tibble(req_json[2: dim(req_json)[1], ])
colnames(req_tibble) <- req_json[1,]
name <- paste0("these_results_", i)
assign(name, req_tibble)
}
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
# Combine the resulting data frame objects
merged_tibble <- inner_join(these_results_1, these_results_2, by = c("place", "state"))
merged_tibble
The code chunk above uses the httr package to manually
construct HTTP queries to retrieve desired pieces of information from
the Census API. The httr::GET function is analogous to your
web browser in that it sends the fully formatted request to the API. The
paste0 function concatenates strings (e.g., baseurl,
param1, etc.) and the httr::GET function “sends it over”.
The jsonlite::fromJSON function helps us to take the very
messy HTTP response and easily convert it to an R object (e.g., data
frame, tibble, etc.) that we can work with. The assign
function is part of base R and allows us to assign names to objects
created inside a loop based on the counter i which
increments each time we make our way through the code. Finally, the
inner_join function from dplyr allows us
to combine or join the tibble objects created by the loop into a single
tibble object where we use the place and the state
attributes as unique identifiers to do the joining.
Please submit an R notebook and knitted HTML file that shows your work and responses for each of the Exercises included in this lab exercise (the code chunk directly above is an example and not a formal exercise). Also, briefly comment on your experience with R during this lab exercise. Please upload your report to Collab by 5:00 pm on Thursday February 9th.
This Lab Exercise will be graded on a 100-point scale according to the rubric below:
Length and formatting (10 pts)
Clarity of writing and attention to detail (20 pts)
Technical Content (45 pts)
Reflective Content (25 pts)